Chapter 9 Multimodality 9.1 Overview 9.1.1 Natural Communication with Machines

نویسنده

James L. Flanagan

چکیده

ion (e.g. ANIMNL, ANTLIMA, SPRINT) (e.g. HAM-ANS, LANDSCAN, VITRA, NAOS) Text Text & Graphics 100% 100% Formal Representation of Information to be Conveyed Multimodal Presentation (e.g. COMET WIP) Graphics / Images V E R B A L I Z A T I O N V E R B A L I Z A T I O N Figure 9.11: Generating and Transforming Presentations in Di erent Modes and Media. A new generation of intelligent multimodal systems (Maybury, 1993) goes beyond the standard canned text, predesigned graphics and prerecorded images and sounds typically found in commercial multimedia systems of today. A basic principle underlying these so-called intellimedia systems is that the various constituents of a multimodal communication should be generated on the y from a common representation of what is to be conveyed without using any preplanned text or images. It is an important goal of such systems not simply to merge the verbalization and visualization results of a text generator and a graphics generator, but to carefully coordinate them in such a way that they generate a multiplicative improvement in communication capabilities. Such multimodal presentation systems are highly adaptive, since all presentation decisions are postponed until runtime. The quest for adaptation is based on the fact that it is impossible to anticipate the needs and requirements of each potential user in an in nite number of presentation situations. The most advanced multimodal presentation systems, that generate text illustrated by 3-D graphics and animations, are COMET (Feiner & McKeown, 1993) and WIP 9.3 Text and Images 351 (Wahlster, Andr e, et al., 1993). COMET generates directions for maintenance and repair of a portable radio and WIP designs multimodal explanations in German and English on using an espresso-machine, assembling a lawn-mower, or installing a modem. Intelligent multimodal presentation systems include a number of key processes: content planning (determining what information should be presented in a given situation), mode selection (apportioning the selected information to text and graphics), presentation design (determining how text and graphics can be used to communicate the selected information), and coordination (resolving con icts and maintaining consistency between text and graphics). Push the code switch S-4 to the right in order to set the modem for reception of data. Connect the telephone cable. S-4 L-11 Turn the on/off switch to the right in order to switch on the modem. After switching on the modem, the LED L-11 lights up. Figure 9.12: A text-picture combination generated by the WIP-System. An important synergistic use of multimodality in systems generating text-picture combinations is the disambiguation of referring expressions. An accompanying picture often makes clear what the intended object of a referring expression is. For example, a technical name for an object unknown to the user may be introduced by clearly singling out the intended object in the accompanying illustration (Figure 9.12). In addition, WIP and COMET can generate cross-modal expressions like, \The on/o switch is shown in the upper left part of the picture," to establish referential relationships of representations in one modality to representations in another modality. The research so far has shown that it is possible to adapt many of the fundamental concepts developed to date in computational linguistics in such a way that they become useful for text-picture combinations as well. In particular, semantic and pragmatic concepts like communicative acts, coherence, focus, reference, discourse model, user model, implicature, anaphora, rhetorical relations and scope ambiguity take on an extended meaning in the context of multimodal communication. 352 Chapter 9: Multimodality 9.3.4 Future Directions Areas which require further investigation include the question how to reason about multiple modes so that the system becomes able to block false implicatures and to ensure that the generated text-picture combination is unambiguous, the role of layout as a rhetorical force, in uencing the intentional and attential state of the viewer, the integration of facial animation and speech of the presentation agent, and the formalization of design knowledge for creating interactive presentations. Key applications for intellimedia systems are multimodal helpware, information retrieval and analysis, authoring, training, monitoring, and decision support. 9.4 Modality Integration: Speech and Gesture 353 9.4 Modality Integration: Speech and Gesture Yacine Bellik LIMSI-CNRS, Orsay, France Speech and gestures are the expression means which are the most used in communication between human beings. Learning of their use begins with the rst years of life. Therefore they should be the modalities to be privileged in communicating with computers (Hauptmann & McAvinney, 1993). Compared to speech, research that aims to integrate gesture as an expression mean (not only as an object manipulation mean) in Human-Computer Interaction (HCI) has recently began. These works have been launched thanks to the appearance of new devices, in particular datagloves which allow us to know about the hand con guration ( exing angles of ngers) at any moment and to follow its position into the 3D space. Multimodality aims not only at making several modalities cohabit in an interactive system, but especially at making them cooperate together (Coutaz, Nigay, et al., 1993; Salisbury, 1990) (for instance, if the user wants to move an object using a speech recognition system and a touch screen as in Figure 9.13, he has just to say put that there while pointing at the object and at its new position; Bolt, 1980). In human communication, the use of speech and gestures is completely coordinated. Unfortunately, and at the opposite of human communication means, the devices used to interact with computers have not been designed at all to cooperate. For instance, the di erence between time responses of devices can be very large (a speech recognition system needs more time to recognize a word than a touch screen driver to compute the point coordinates relative to a pointing gesture). This implies that the system receives an information stream in an order which does not correspond to the real chronological order of user's actions (like a sentence in which words have been mixed up). Consequently, this can lead to bad interpretations of user statements. The fusion of information issued from speech and gesture constitutes a major problem. Which criteria should we use to decide the fusion of an information with another one, and at what abstraction level should this fusion be done? On the one hand, a fusion at a lexical level allows for designing generic multimodal interface tools, though fusion errors may occur. On the other hand, a fusion at a semantic level is more robust because it exploits many more criteria, but it is in general application-dependent. It is also important to handle possible semantic con icts between speech and gesture and to exploit information redundancy when it occurs. Time is an important factor in interfaces which integrate speech and gesture (Bellik, 354 Chapter 9: Multimodality Figure 9.13: Working with a multimodal interface including speech and gesture. The user speaks while pointing on the touch screen to manipulate the objects. The time correlation of pointing gestures and spoken utterances is important to determine the meaning of his action. 1995). It is one of the basic criterion necessary (but not su cient) for the fusion process and it allows for reconstituting the real chronological order of information. So it is necessary to assign dates (timestamps) to all messages (words, gestures, etc.) produced by the user. It is also important to take into account the characteristics of each modality (Bernsen, 1993) and their technological constraints. For instance, operations which require high security should be assigned to the modalities which present lower error recognition risks, or should demand redundancy to reduce these risks. It can be necessary to de ne a multimodal grammar. In a perfect case, this grammar should also take into account other parameters such as the user state, current task, and environment (for instance, a high noise level will prohibit the use of speech). 9.4 Modality Integration: Speech and Gesture 355 Future Directions The e ectiveness of a multimodal interface depends in a large part on performances of each modality taken separately. If remarkable progress has been accomplished in speech processing, more e orts should be produced to improve gesture recognition systems, in particular for continuous gestures. Systems with touch feed-back and/or force feed-back which become more and more numerous will allow us to improve the comfort of gesture use, in particular for 3D applications, in the near future. 356 Chapter 9: Multimodality 9.5 Modality Integration: Facial Movement & Speech Recognition Alan J. Goldschen Center of Innovative Technology, Herndon, Virginia, USA 9.5.1 Background A machine should be capable of performing automatic speech recognition through the use of several knowledge sources, analogous, to a certain extent, to those sources that humans use (Erman & Lesser, 1990). Current speech recognizers use only acoustic information from the speaker, and in noisy environments often use secondary knowledge sources such as a grammar and prosody. One source of secondary information that has been primarily been ignored is optical information (from the face and in particular the oral-cavity region of a speaker), that often has information redundant with the acoustic information, and is often not corrupted by the processes that cause the acoustical noise (Silsbee, 1993). In noisy environments, humans rely on a combination of speech (acoustical) and visual (optical) sources, and this combination improves the signal-to-noise ratio by a gain of 10 to 12 dB (Brooke, 1990). Analogously, machine recognition should improve when combining the acoustical source with an optical source that contains information from the facial region such as gestures, expressions, head-position, eyebrows, eyes, ears, mouth, teeth, tongue, cheeks, jaw, neck, and hair (Pelachaud, Badler, et al., 1994). Human facial expressions provide information about emotion (anger, surprise), truthfulness, temperament (hostility), and personality (shyness) (Ekman, Huang, et al., 1993). Furthermore, human speech production and facial expression are inherently linked by a synchrony phenomenon, where changes often occur simultaneously with speech and facial movements (Pelachaud, Badler, et al., 1994; Condon & Osgton, 1971). An eye blink movement may occur at the beginning or end of a word, while oral-cavity movements may cease at the end of a sentence. In human speech perception experiments, the optical information is complementary to the acoustic information because many of the phones that are said to be close to each other acoustically are very distant from each other visually (Summer eld, 1987). Visually similar phones such as /p/, /b/, /m/ form a viseme, which is speci c oral-cavity movements that corresponds to a phone (Fisher, 1968). It appears that the consonant phone-to-viseme mapping is many-to-one (Finn, 1986; Goldschen, 1993) and the vowel phone-to-viseme mapping is nearly one-to-one (Goldschen, 1993). For example, the phone /p/ appears visually similar to the phones /b/ and /m/ and at a 9.5 Modality Integration: Facial Movement & Speech Recognition 357 signal-to-noise ratio of zero /p/ is acoustically similar to the phones /t/, /k/, /f/, /th/, and /s/ (Summer eld, 1987). Using both sources of information, humans (or machines) can determine the phone /p/. However, this fusion of acoustical and optical sources does sometimes cause humans to perceive a phone di erent from either the acoustically or optically presented phone, and is known as the McGurk e ect (McGurk & MacDonald, 1976). In general, the perception of speech in noise improves greatly when presented with acoustical and optical sources because of the complementarity of the sources. 9.5.2 Systems Some speech researchers are developing systems that use the complementary acoustical and optical sources of information to improve their acoustic recognizers, especially in noisy environments. These systems primarily focus on integrating optical information from the oral-cavity region of a speaker (automatic lipreading) with acoustic information. The acoustic source often consists of a sequence of vectors containing, or some variation of, linear predictive coe cients or lter bank coe cients (Rabiner & Schafer, 1978; Deller, Proakis, et al., 1993). The optical source consists of a sequence of vectors containing static oral-cavity features such as the area, perimeter, height, and width of the oral-cavity (Petajan, 1984; Petajan, Bischo , et al., 1988), jaw opening (Stork, Wol , et al., 1992), lip rounding and number of regions or blobs in the oral-cavity (Goldschen, 1993; Garcia, Goldschen, et al., 1992; Goldschen, Garcia, et al., 1994). Other researchers model the dynamic movements of the oral cavity using derivatives (Goldschen, 1993; Smith, 1989; Nishida, 1986), surface learning (Bregler, Omohundro, et al., 1994), deformable templates (Hennecke, Prasad, et al., 1994; Rao & Mersereau, 1994), or optical ow techniques (Pentland & Mase, 1989; Mase & Pentland, 1991). There have been two basic approaches towards building a system that uses both acoustical and optical information. The rst approach uses a comparator to merge the two independently recognized acoustical and optical events. This comparator may consist of a set of rules (e.g., if the top two phones from the acoustic recognizer is /t/ or /p/, then choose the one that has a higher ranking from the optical recognizer) (Petajan, Bischo , et al., 1988) or a fuzzy logic integrator (e.g., provides linear weights associated with the acoustically and optically recognized phones) (Silsbee, 1993; Silsbee, 1994). The second approach performs recognition using a vector that includes both acoustical and optical information, such systems typically use neural networks to combine the optical information with the acoustic to improve the signal-to-noise ratio before phonemic recognition (Yuhas, Goldstein, et al., 1989; Bregler, Omohundro, et al., 1994; Bregler, Hild, et al., 1993; Stork, Wol , et al., 1992; Silsbee, 1994). Regardless of the signal-to-noise ratio, most systems perform better using both 358 Chapter 9: Multimodality acoustical and optical sources of information than when using only one source of information (Bregler, Omohundro, et al., 1994; Bregler, Hild, et al., 1993; Mak & Allen, 1994; Petajan, 1984; Petajan, Bischo , et al., 1988; Silsbee, 1994; Silsbee, 1993; Smith, 1989; Stork, Wol , et al., 1992; Yuhas, Goldstein, et al., 1989). At a signal-to-noise ratio of zero with a 500-word task Silsbee (1993), achieves word accuracy recognition rates of 38%, 22%, and 58% respectively, using acoustical information, optical information, and both sources of information. Similarly, for a German alphabetical letter recognition task, Bregler, Hild, et al. (1993) achieve a recognition accuracy of 47%, 32%, and 77%, respectively, using acoustical information, optical information, and both sources of information. 9.5.3 Future Directions In summary, most of the current systems use an optical source containing information from the oral-cavity region of speaker (lipreading) to improve the robustness of the information from the acoustic source. Future systems will likely improve this optical source and use additional features from the facial region. 9.6 Modality Integration: Facial Movement & Speech Synthesis 359 9.6 Modality Integration: Facial Movement & Speech Synthesis Christian Benoit,a Dominic W. Massaro,b & Michael M. Cohenb a Universite Stendhal, Grenoble, France b University of California, Santa Cruz, California, USA There is valuable and e ective information a orded by a view of the speaker's face in speech perception and recognition by humans. Visible speech is particularly e ective when the auditory speech is degraded, because of noise, bandwidth ltering, or hearing-impairment (Sumby & Pollack, 1954; Erber, 1975; Summer eld, 1979; Massaro, 1987; Benô t, Mohamadi, et al., 1994) The strong in uence of visible speech is not limited to situations with degraded auditory input, however. A perceiver's recognition of an auditory-visual syllable re ects the contribution of both sound and sight. When an auditory syllable /ba/ is dubbed onto a videotape of a speaker saying /ga/, subjects perceive the speaker to be saying /da/ (McGurk & MacDonald, 1976). There is thus an evidence that: (1) synthetic faces increase the intelligibility of synthetic speech, (2) but under the condition that facial gestures and speech sounds are coherent. To reach this goal, the articulatory parameters of the facial animation have to be controlled so that it looks like and it sounds like the auditory output is generated by the visual displacements of the articulators. Not only disynchrony or incoherence between the two modalities don't increase speech intelligibility; they might even decrease it. Most of the existing parametric models of the human face have been developed in the perspective of optimizing the visual rendering of facial expressions (Parke, 1974; Platt & Badler, 1981; Bergeron & Lachapelle, 1985; Waters, 1987; Magnenat-Thalmann, Primeau, et al., 1988; Viaud & Yahia, 1992). Few models have focused on the speci c articulation of speech gestures: Saintourens, Tramus, et al. (1990); Benô t, Lallouache, et al. (1992); Henton and Litwinovitz (1994) prestored a limited set of facial images occurring in the natural production of speech in order to synchronize the processes of diphone concatenation and visemes display in a text-to-audio-visual speech synthesizer. Ultimately, the coarticulation e ects and the transition smoothing are much more naturally simulated by means of parametric models specially controlled for visual speech animation, such as the 3-D lip model developed by Guiard-Marigny, Adjoudani, et al. (1994) or the 3-D model of the whole face adapted to speech control by Cohen and Massaro (1990). Those two models are displayed on Figure 9.14. A signi cant gain in intelligibility due to a coherent animation of a synthetic face has 360 Chapter 9: Multimodality Figure 9.14: Left panel: gouraud shading of the face model originally developed by Parke (1974) and adapted to speech gestures by Cohen and Massaro (1993). A dozen parameters allow the synthetic face to be correctly controlled for speech. Right panel: wireframe structure of the 3-D model of the lips developed by Guiard-Marigny, Adjoudani, et al. (1994). The internal and external contours of the model can take all the possible shapes of natural lips speaking in a neutral expression. obviously been obtained at the University of California in Santa Cruz by improving the Parke model (Cohen & Massaro, 1993) and then synchronizing it to the MITalk rule-based speech synthesizer (even though no quantitative measurements are yet available). In parallel, intelligibility tests have been carried out at the ICP-Grenoble in order to compare the bene t of seeing the natural face, a synthetic face, or synthetic lips while listening to natural speech under various conditions of acoustic degradation (Go , Guiard-Marigny, et al., 1994). Whatever the degradation level, the two thirds of the missing information are compensated by the vision of the entire speaker's face; half is compensated by the vision of a synthetic face controlled through six parameters directly measured on the original speaker's face; a third of the missing information is compensated by the vision of a 3-D model of the lips, controlled only through four of these command parameters (without seeing the teeth, the tongue or the jaw). All these ndings support the evidence that technological spin-o s are expected in two main areas of application. On one hand, even 9.6 Modality Integration: Facial Movement & Speech Synthesis361though the quality of some text-to-speech synthesizers is now such that simple messagesare very intelligible when synthesized in clear acoustic conditions, it is no longer the casewhen the message is less predictable (proper names, numbers, complex sentences, etc.)or when the speech synthesizer is used in a natural environment (e.g., the telephonenetwork or in public places with background noise.) Then, the display of a synthetic facecoherently animated in synchrony with the synthetic speech makes the synthesizer soundmore intelligible and look more pleasant and natural. On the other hand, the quality ofcomputer graphics rendering is now such that human faces can be very naturallyimitated. Today, the audience no longer accepts all those synthetic actors behaving likeif their voice was dubbed from another language. There is thus a strong pressure fromthe movie and the entertainment industry to overcome the problem of automatizing thelip-synchronization process so that the actors facial gestures look natural.Future DirectionsTo conclude, research in the area of visible speech is a fruitful paradigm forpsychological inquiry (Massaro, 1987). Video analysis of human faces is a simpleinvestigation technique which allows a better understanding of how speech is producedby humans (Abry & Lallouache, 1991). Face and lip modeling allows the experimentersto manipulate controlled stimuli and to evaluate hypotheses and descriptiveparametrizations in terms of visual and bimodal intelligibility of speech. Finally,bimodal integration of facial animation and acoustic synthesis is a fascinating challengefor a better description and comprehension of each language in which this technology isdeveloped. It is also a necessary and promising step towards the realization ofautonomous agents in human-machine virtual interfaces. 362Chapter 9: Multimodality9.7 Chapter ReferencesAbry, C. and Lallouache, M. T. (1991). Audibility and stability of articulatorymovements: Deciphering two experiments on anticipatory rounding in French. InProceedings of the 12th International Congress of Phonetic Sciences, volume 1,pages 220{225, Aix-en-Provence, France.Allen, J. (1983). Maintaining knowledge about temporal intervals. Communications ofthe ACM, 26(11):832{843.Anger, F. D., Gusgen, H. W., and van Benthem, J., editors (1993). Proceedings of theIJCAI-93 Workshop on Spatial and Temporal Reasoning (W17), Chambery, France.Asilomar (1994). Proceedings of the 28th Asilomar Conference on Signals, Systems andComputers. IEEE.Aurnague, M. and Vieu, L. (1993). A logical framework for reasoning about space. InAnger, F. D., Gusgen, H. W., and van Benthem, J., editors, Proceedings of theIJCAI-93 Workshop on Spatial and Temporal Reasoning (W17), pages 123{158,Chambery, France.Badler, N. I., Phillips, C. B., and Webber, B. L. (1993). Simulating Humans: ComputerGraphics Animation and Control. Oxford University Press, New York.Bajcsy, R., Joshi, A., Krotkov, E., and Zwarico, A. (1985). LandScan: A naturallanguage and computer vision system for analyzing aerial images. In Proceedings ofthe 9th International Joint Conference on Arti cial Intelligence, pages 919{921, LosAngeles.Bellik, Y. (1995). Interfaces Multimodales: Concepts, Modeles et Architectures. PhDthesis, Universite d'Orsay, Paris.Benô t, C., Lallouache, M. T., Mohamadi, T., and Abry, C. (1992). A set of Frenchvisemes for visual speech synthesis. In Bailly, G. and Benô t, C., editors, TalkingMachines: Theories, Models, and Designs, pages 485{504. Elsevier Science.Benô t, C., Mohamadi, T., and Kandel, S. (1994). E ects of phonetic context onaudio-visual intelligibility of French. Journal of Speech and Hearing Research,37:1195{1203.Bergeron, P. and Lachapelle, P. (1985). Controlling facial expressions and bodymovements in the computer generated animated short `tony de peltrie'. In SigGraph'85 Tutorial Notes. 9.7 Chapter References363Berkley, D. A. and Flanagan, J. L. (1990). HuMaNet: An experimental human/machinecommunication network based on ISDN. AT&T; Technical Journal, 69:87{98.Bernsen, N. O. (1993). Modality theory: Supporting multimodal interface design. InERCIM, editor, Proceedings of the Workshop ERCIM on Human-ComputerInteraction, Nancy.Blonder, G. E. and Boie, R. A. (1992). Capacitive moments sensing for electronic paper.U.S. Patent 5 113 041.Bolt, R. A. (1980). Put-that-there: Voice and gesture at the graphic interface.Computer Graphics, 14(3):262{270.Bregler, C., Hild, H., Manke, S., and Waibel, A. (1993). Improving connected letterrecognition by lipreading. In Proceedings of the 1993 International Joint Conferenceon Speech and Signal Processing, volume 1, pages 557{560. IEEE.Bregler, C., Omohundro, S., and Konig, Y. (1994). A hybrid approach to bimodalspeech recognition. In Proceedings of the 28th Asilomar Conference on Signals,Systems and Computers. IEEE.Brooke, N. M. (1990). Visible speech signals: Investigating their analysis, synthesis andperception. In Taylor, M. M., Neel, F., and Bouwhuis, D. G., editors, The Structureof Multimodal Dialogue. Elsevier Science, Amsterdam.Brooks, F., Ouh-Young, M., Batter, J., and Jerome, P. (1990). Project GROPE: Hayticdisplays for scienti c visualization. Computer Graphics, 24(4):177{185.Burdea, G. and Coi et, P. (1994). Virtual Reality Technology. John Wiley, New York.Burdea, G. and Zhuang, J. (1991). Dextrous telerobotics with force feedback. Robotica,9(1 & 2):171{178; 291{298.Cherry, C. (1957). On Human Communication. Wiley, New York.Cohen, M. M. and Massaro, D. W. (1990). Synthesis of visible speech. BehaviourResearch Methods, Instruments & Computers, 22(2):260{263.Cohen, M. M. and Massaro, D. W. (1993). Modeling coarticulation in synthetic visualspeech. In Thalmann, N. M. and Thalmann, D., editors, Models and techniques incomputer animation, pages 139{156. Springer-Verlag, Tokyo.Cohn, A. (1993). Modal and non-modal qualitative spatial logics. In Anger, F. D.,Gusgen, H. W., and van Benthem, J., editors, Proceedings of the IJCAI-93Workshop on Spatial and Temporal Reasoning (W17), pages 87{92, Chambery,France. 364Chapter 9: MultimodalityCondon, W. and Osgton, W. (1971). Speech and body motion synchrony of thespeaker-hearer. In Horton, D. and Jenkins, J., editors, The Perception of Language,pages 150{184. Academic Press.COSIT (1993). Proceedings of the European Conference on Spatial Information Theory(COSIT'93), volume 716 of Lecture Notes in Computer Science. Springer-Verlag.Coutaz, J., Nigay, L., and Salber, D. (1993). The MSM framework: A design space formulti-sensori-motor systems. In Bass, L., Gornostaev, J., and Under, C., editors,Lecture Notes in Computer Science, Selected Papers, EWCHI'93, East-West HumanComputer Interaction, pages 231{241. Springer-Verlag, Moscow.Deller, John R., J., Proakis, J. G., and Hansen, J. H. (1993). Discrete-Time Processingof Speech Signals. MacMillan.Ekman, P., Huang, T., Sejnowski, T., and Hager, J. (1993). Final report to NSF of theplanning workshop on facial expression understanding (July 30 to August 1, 1992).Technical report, University of California, San Francisco.Erber, N. P. (1975). Auditory-visual perception of speech. Journal of Speech andHearing Disorders, 40:481{492.Erman, L. and Lesser, V. (1990). The Hearsay-II speech understanding system: Atutorial. In Readings in Speech Recognition, pages 235{245. Morgan Kaufmann.ESCA (1994). Proceedings of the Second ESCA/IEEE Workshop on Speech Synthesis,New Paltz, New York. European Speech Communication Association.Feiner, S. K. and McKeown, K. R. (1993). Automating the generation of coordinatedmultimedia explanations. In Maybury, M. T., editor, Intelligent MultimediaInterfaces, pages 117{138. AAAI Press, Menlo Park, California.Finn, K. (1986). An Investigation of Visible Lip Information to be use in AutomaticSpeech Recognition. PhD thesis, Georgetown University.Fisher, C. G. (1968). Confusions among visually perceived consonants. Journal ofSpeech and Hearing Research, 11:796{804.Flanagan, J. L. (1992). Technologies for multimedia information systems. Transactions,Institute of Electronics, Information and Communication Engineers, 75(2):164{178.Flanagan, J. L. (1994). Technologies for multimedia communications. Proceedings of theIEEE, 82(4):590{603. 9.7 Chapter References365Flanagan, J. L., Surendran, A. C., and Jan, E. E. (1993). Spatially selective soundcapture for speech and audio processing. Speech Communication, 13:207{222.Frank, A. U., Campari, I., and Formentini, U. (1992). Proceedings of the internationalconference GIS|from space to territory: Theories and methods of spatio-temporalreasoning. In Proceedings of the International Conference GIS|From Space toTerritory: Theories and Methods of Spatio-Temporal Reasoning, number 639 inSpringer Lecture Notes in Computer Science, Pisa, Italy. Springer-Verlag.Fraser, A. G., Kalmanek, C. R., Kaplan, A. E., Marshall, W. T., and Restrick, R. C.(1992). XUNET 2: A nationwide testbed in high-speed networking. In INFOCOM92, Florence, Italy.Furui, S. (1989). Digital Speech Processing, Synthesis, and Recognition. Marcel Dekker,New York.Garcia, O., Goldschen, A., and Petajan, E. (1992). Feature extraction for opticalautomatic speech recognition or automatic lipreading. Technical ReportGWU-IIST-92-32, The George Washington University, Department of ElectricalEngineering and Computer Science.Go , B. L., Guiard-Marigny, T., Cohen, M., and Benô t, C. (1994). Real-timeanalysis-synthesis and intelligibility of talking faces. In Proceedings of the SecondESCA/IEEE Workshop on Speech Synthesis, pages 53{56, New Paltz, New York.European Speech Communication Association.Goldschen, A. (1993). Continuous Automatic Speech Recognition by Lipreading. PhDthesis, The George Washington University, Washington, DC.Goldschen, A., Garcia, O., and Petajan, E. (1994). Continuous optical automatic speechrecognition. In Proceedings of the 28th Asilomar Conference on Signals, Systemsand Computers. IEEE.Guiard-Marigny, T., Adjoudani, A., and Benô t, C. (1994). A 3-D model of the lips forvisual speech synthesis. In Proceedings of the Second ESCA/IEEE Workshop onSpeech Synthesis, pages 49{52, New Paltz, New York. European SpeechCommunication Association.Hauptmann, A. G. and McAvinney, P. (1993). Gestures with speech for graphicmanipulation. International Journal of Man-Machine Studies, 38(2):231{249.Hennecke, M., Prasad, K., and Stork, D. (1994). Using deformable templates to infervisual speech dynamics. In Proceedings of the 28th Asilomar Conference on Signals,Systems and Computers. IEEE. 366Chapter 9: MultimodalityHenton, C. and Litwinovitz, P. (1994). Saying and seeing it with feeling: techniques forsynthesizing visible, emotional speech. In Proceedings of the Second ESCA/IEEEWorkshop on Speech Synthesis, pages 73{76, New Paltz, New York. EuropeanSpeech Communication Association.Herkovits, A. (1986). Language and Cognition. Cambridge University Press, New York.ICP (1993). Bulletin de la communication parlee, 2. Universite Stendhal, Grenoble,France.Keidel, W. D. (1968). Information processing by sensory modalities in man. InCybernetic Problems in Bionics, pages 277{300. Gordon and Breach.Lascarides, A. and Asher, N. (1993). Maintaining knowledge about temporal intervals.Linguistics and Philosophy, 16(5):437{493.Ligozat, G. (1993). Models for qualitative spatial reasoning. In Anger, F. D., Gusgen,H. W., and van Benthem, J., editors, Proceedings of the IJCAI-93 Workshop onSpatial and Temporal Reasoning (W17), pages 35{45, Chambery, France.Magnenat-Thalmann, N., Primeau, E., and Thalmann, D. (1988). Abstract muscleaction procedures for human face animation. Visual Computer, 3:290{297.Mak, M. W. and Allen, W. G. (1994). Lip-motion analysis for speech segmentation innoise. Speech Communication, 14:279{296.Mariani, J., Teil, D., and Silva, O. D. (1992). Gesture recognition. Technical ReportLIMSI Report, Centre National de la Recherche Scienti que, Orsay, France.Mark, D. M. and Frank, A. U., editors (1991). Cognitive and Linguistic Aspects ofGeographic Space, Dordrecht. NATO Advanced Studies Institute, Kluwer.Mase, K. and Pentland, A. (1991). Automatic lipreading by optical ow analysis.Systems and Computer in Japan, 22(6):67{76.Massaro, D. W. (1987). Speech perception by ear and eye: a paradigm for psychologicalinquiry. Lawrence Earlbaum, Hillsdale, New Jersey.Maybury, M. T., editor (1993). Intelligent Multimedia Interfaces. AAAI Press, MenloPark, California.McDermott, D. (1982). A temporal logic for reasoning about processes and plans.Cognitive Science, 6:101{155. 9.7 Chapter References367McGurk, H. and MacDonald, J. (1976). Hearing lips and seeing voices. Nature,264:746{748.McKevitt, P. (1994). The integration of natural language and vision processing.Arti cial Intelligence Review Journal, 8:1{3. Special volume.Moens, M. and Steedman, M. J. (1988). Temporal ontology and temporal reference.Computational linguistics, 14(2):15{28.Nakhimovsky, A. (1988). Aspect, aspectual class, and the temporal structure ofnarrative. Computational Linguistics, 14(2):29{43.Nebel, B. and Burckert, H.-J. (1993). Reasoning about temporal relations: A maximaltractable subclass of Allen's interval algebra. Technical Report RR-93-11, DFKI,Saarbrucken, Germany.Netravali, A. and Haskel, B. (1988). Digital Pictures. Plenum Press, New York.Neumann, B. (1989). Natural language description of time-varying scenes. In Waltz, D.,editor, Semantic Structures, pages 167{207. Lawrence Earlbaum, Hillsdale, NewJersey.Nishida (1986). Speech recognition enhancement by lip information. ACM SIGCHIBulletin, 17(4):198{204.Parke, F. I. (1974). A parametric model for human faces. PhD thesis, University ofUtah, Department of Computer Sciences.Pelachaud, C., Badler, N., and Viaud, M.-L. (1994). Final report to NSF of thestandards for facial animation workshop. Technical report, University ofPennsylvania, Philadelphia.Pentland, A. and Mase, K. (1989). Lip reading: Automatic visual recognition of spokenwords. Technical Report MIT Media Lab Vision Science Technical Report 117,Massachusetts Institute of Technology.Petajan, E. (1984). Automatic Lipreading to Enhance Speech Recognition. PhD thesis,University of Illinois at Urbana-Champaign.Petajan, E., Bischo , B., Bodo , D., and Brooke, N. M. (1988). An improved automaticlipreading system to enhance speech recognition. CHI 88, pages 19{25.Pierce, J. R. (1961). Symbols, Signals and Noise. Harper and Row, New York. 368Chapter 9: MultimodalityPlatt, S. M. and Badler, N. I. (1981). Animating facial expressions. Computer Graphics,15(3):245{252.Podilchuk, C. and Farvardin, N. (1991). Perceptually based low bit rate video coding. InProceedings of the 1991 International Conference on Acoustics, Speech, and SignalProcessing, volume 4, pages 2837{2840, Toronto. Institute of Electrical andElectronic Engineers.Podilchuk, C., Jayant, N. S., and Noll, P. (1990). Sparse codebooks for the quantizationof non-dominant sub-bands in image coding. In Proceedings of the 1990International Conference on Acoustics, Speech, and Signal Processing, pages2101{2104, Albuquerque, New Mexico. Institute of Electrical and ElectronicEngineers.Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications inspeech recognition. Proceedings of the IEEE, 77(2):257{286.Rabiner, L. R. and Schafer, R. W. (1978). Digital Processing of Speech Signals. SignalProcessing. Prentice-Hall, Englewood Cli s, New Jersey.Rao, R. and Mersereau, R. (1994). Lip modeling for visual speech recognition. InProceedings of the 28th Asilomar Conference on Signals, Systems and Computers.IEEE.Roe, D. B., Moreno, P. J., Sproat, R. W., Pereira, F. C. N., Riley, M. D., and Macarron,A. (1992). A spoken language translator for restricted-domain context-freelanguages. Speech Communication, 11:311{319. System demonstrated by AT&T;Bell Labs and Telefonica de Espana, VEST, Worlds Fair Exposition, Barcelona,Spain.Saintourens, M., Tramus, M. H., Huitric, H., and Nahas, M. (1990). Creation of asynthetic face speaking in real time with a synthetic voice. In Bailly, G. and Benô t,C., editors, Proceedings of the First ESCA Workshop on Speech Synthesis, pages249{252, Autrans, France. European Speech Communication Association.Salisbury, M. W. (1990). Talk and draw: Bundling speech and graphics. IEEEComputer, pages 59{65.Schirra, J. and Stopp, E. (1993). ANTLIMA|a listener model with mental images. InProceedings of the 13th International Joint Conference on Arti cial Intelligence,pages 175{180, Chambery, France.Silsbee, P. (1993). Computer Lipreading for Improved Accuracy in Automatic SpeechRecognition. PhD thesis, The University of Texas at Austin. 9.7 Chapter References369Silsbee, P. (1994). Sensory integration in audiovisual automatic speech recognition. InProceedings of the 28th Asilomar Conference on Signals, Systems and Computers.IEEE.Smith, S. (1989). Computer lip reading to augment automatic speech recognition.Speech Tech, pages 175{181.Stork, D., Wol , G., and Levine, E. (1992). Neural network lipreading system forimproved speech recognition. In Proceedings of the 1992 International JointConference on Neural Networks, Baltimore, Maryland.Sumby, W. H. and Pollack, I. (1954). Visual contribution to speech intelligibility innoise. Journal of the Acoustical Society of America, 26:212{215.Summer eld, Q. (1979). Use of visual information for phonetic perception. Phonetica,36:314{331.Summer eld, Q. (1987). Some preliminaries to a comprehensive account of audio-visualspeech perception. In Dodd, B. and Campbell, R., editors, Hearing by Eye: ThePsychology of Lipreading, pages 3{51. Lawrence Earlbaum, Hillsdale, New Jersey.Vandeloise, C. (1986). L'espace en francais: semantique des prepositions spatiales. Seuil,Paris.Viaud, M. L. and Yahia, H. (1992). Facial animation with wrinkles. In Proceedings ofthe 3rd Workshop on Animation, Eurographic's 92, Cambridge, England.Vila, L. (1994). A survey on temporal reasoning in arti cial intelligence. AICOM,7(1):832{843.Wahlster, W. (1989). One word says more than a thousand pictures. on the automaticverbalization of the results of image sequence analysis systems. Computers andArti cial Intelligence, 8:479{492.Wahlster, W., Andre, E., Finkler, W., Pro tlich, H.-J., and Rist, T. (1993). Plan-basedintegration of natural language and graphics generation. Arti cial Intelligence,pages 387{427.Wahlster, W., Marburger, H., Jameson, A., and Busemann, S. (1983). Over-answeringyes-no questions: Extended responses in a NL interface to a vision system. InProceedings of the 8th International Joint Conference on Arti cial Intelligence,pages 643{646, Karlsruhe. 370Chapter 9: MultimodalityWaibel, A. (1993). Multimodal human-computer interaction. In Eurospeech '93,Proceedings of the Third European Conference on Speech Communication andTechnology, volume Plenary, page 39, Berlin. European Speech CommunicationAssociation.Waters, K. (1987). A muscle model for animating three-dimensional facial expression. InProceedings of Computer Graphics, volume 21, pages 17{24.Webber, B. L. (1988). Tense as discourse anaphor. Computational linguistics,14(2):61{73.Wilpon, J., Rabiner, L., Lee, C., and Goldman, E. (1990). Automatic recognition of keywords in unconstrained speech using hidden markov models. IEEE Transactions onAcoustics, Speech and Signal Processing, 38(11):1870{1878.Yamada, A., Yamamoto, T., Ikeda, H., Nishida, T., and Doshita, S. (1992).Reconstructing spatial images from natural language texts. In Proceedings of the14th International Conference on Computational Linguistics, pages 1279{1283,Nantes, France. ACL.Yuhas, B., Goldstein, M., and Sejnowski, T. (1989). Integration of acoustic and visualspeech signals using neural networks. IEEE Communications Magazine, pages65{71.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

22.51 Course Notes, Chapter 9: Harmonic Oscillator

9.1 Harmonic Oscillator 9.1.1 Classical harmonic oscillator and h.o. model 9.1.2 Oscillator Hamiltonian: Position and momentum operators 9.1.3 Position representation 9.1.4 Heisenberg picture 9.1.5 Schrödinger picture 9.2 Uncertainty relationships 9.3 Coherent States 9.3.1 Expansion in terms of number states 9.3.2 Non-Orthogonality 9.3.3 Uncertainty relationships 9.3.4 X-representation 9.4 Phon...

متن کامل

Hazardous Waste: Management and Treatment

Overview Learning Objectives 9.1 Hazardous Waste: Identification and Classification 9.1.1 Identification 9.1.2 Classification 9.2 Hazardous Waste Management 9.2.1 Generation 9.2.2 Storage and collection 9.2.3 Transfer and transport 9.2.4 Processing 9.2.5 Disposal 9.3 Hazardous Waste Treatment 9.3.1 Physical and chemical treatment 9.3.2 Thermal treatment 9.3.3 Biological treatment 9.4 Pollution ...

متن کامل

Monitoring Bathing Waters -a Practical Guide to the Design and Implementation of Assessments and Monitoring Programmes Edited Chapter 9*: Approaches to Microbiological Monitoring 9.1 Issues 9.1.1 Current Regulatory Schemes

* This chapter represents the conclusions of a meeting of experts organised by WHO in cooperation with the US EPA and held in Annapolis, MD, USA, in November 1998 (for further details, see the Acknowledgements section). The US EPA has not conducted a policy and legal evaluation of this chapter. Despite evident successes in the protection of public health, present approaches to the regulation of...

متن کامل

Future Internet Research Roadmap FIA community input to FP8 Eurescom contribution

9 Scenarios 15 9.1 The Software Breeder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 9.1.1 Socio-economic and Business Issues . . . . . . . . . . . . . . . . . . . 16 9.1.2 Technological Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 9.2 Artificial Beings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 9.2.1 Socio-economic and business...

متن کامل

Force Contro 195 Part A | 9 . 1 9 . Force Control

A fundamental requirement for the success of a manipulation task is the capability to handle the physical contact between a robot and the environment. Pure motion control turns out to be inadequate because the unavoidable modeling errors and uncertainties may cause a rise of the contact force, ultimately leading to an unstable behavior during the interaction, especially in the presence of rigid...

متن کامل

Robotic Action Control: On the Crossroads of Cognitive Psychology and Cognitive Robotics

9.1 Early history of the fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 9.1.1 History of cognitive psychology . . . . . . . . . . . . . . . . . . . . 172 9.1.2 The computer analogy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 9.1.3 Early cognitive robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 9.2 Action contro...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2001

Chapter 9 Multimodality 9.1 Overview 9.1.1 Natural Communication with Machines

نویسنده

چکیده

منابع مشابه

22.51 Course Notes, Chapter 9: Harmonic Oscillator

Hazardous Waste: Management and Treatment

Monitoring Bathing Waters -a Practical Guide to the Design and Implementation of Assessments and Monitoring Programmes Edited Chapter 9*: Approaches to Microbiological Monitoring 9.1 Issues 9.1.1 Current Regulatory Schemes

Future Internet Research Roadmap FIA community input to FP8 Eurescom contribution

Force Contro 195 Part A | 9 . 1 9 . Force Control

Robotic Action Control: On the Crossroads of Cognitive Psychology and Cognitive Robotics

عنوان ژورنال:

اشتراک گذاری